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COMPOSITE QUANTILE REGRESSION AND THE ORACLE 
MODEL SELECTION THEORY 1 

By Hui Zou and Ming Yuan 
University of Minnesota and Georgia Institute of Technology 

Coefficient estimation and variable selection in multiple linear re- 
gression is routinely done in the (penalized) least squares (LS) frame- 
work. The concept of model selection oracle introduced by Fan and 
Li [J. Amer. Statist. Assoc. 96 (2001) 1348-1360] characterizes the 
optimal behavior of a model selection procedure. However, the least- 
squares oracle theory breaks down if the error variance is infinite. In 
the current paper we propose a new regression method called com- 
posite quantile regression (CQR). We show that the oracle model 
selection theory using the CQR oracle works beautifully even when 
the error variance is infinite. We develop a new oracular procedure 
to achieve the optimal properties of the CQR oracle. When the er- 
ror variance is finite, CQR still enjoys great advantages in terms of 
estimation efficiency. We show that the relative efficiency of CQR 
compared to the least squares is greater than 70% regardless the er- 
ror distribution. Moreover, CQR could be much more efficient and 
sometimes arbitrarily more efficient than the least squares. The same 
conclusions hold when comparing a CQR-oracular estimator with a 
LS-oracular estimator. 

1. Introduction and motivation. 

1.1. Background. In recent years, various techniques have been devel- 
oped for simultaneous variable selection and coefficient estimation in mul- 
tiple linear regression. Notable methods include the nonnegative garrote 
[Breiman (1995)], the lasso [Tibshirani (1996)] and the SCAD [Fan and Li 
(2001)]. Fan and Li (2006) gave a comprehensive overview of recent advances 
in variable selection. 
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Fan and Li (2001) introduced the concept of model selection oracle to 
guide the construction of optimal model selection procedures. To elaborate, 
consider the following linear model 

(1.1) y = 5>i/3i+e- 

Without loss of generality we center the predictors. Denote A = {j : /9| 7^ 0}. 
The problem of variable selection and coefficient estimation is to identify 
the unknown set A and estimate the corresponding coefficients /3^, using 
n independent samples generated from the model (1.1). To understand the 
optimality of variable selection and coefficient estimation, Fan and Li (2001) 
suggested considering the oracle who knows the true subset .4. The oracle 
would only need to estimate fi\ and set j3\ c =0. It is worth emphasizing 
here that although the oracle knows the true subset model, the error distri- 
bution remains unknown. Fan and Li (2001) considered the oracle estimator 
which estimates f5\ by least squares, which we shall refer to as the LS-oracle. 
Denote by X the design matrix and assume that lim^^oo ^X T X = C, where 
C is a p x p positive definite matrix. Write C.4.4 the sub-matrix of C with 
both row and column indices in A. We have 

T O 

(1.2) v^(/3 (oracle) A - p* A ) ^ d N(0, a 2 C^), 

where a 2 is the variance of e. 

Note that the oracle "estimator" is not a legitimate estimator because 
it uses the information of A, which is unavailable in practice. Nevertheless, 
the oracular model selection theory provides a golden standard for eval- 
uating variable selection and coefficient estimation procedures. Following 
Fan and Li (2001), we say a variable selection and coefficient estimation 
procedure rj is a LS-oracular estimator, if (3(rj) (asymptotically) has the 
following oracle properties: 

• Consistent selection: Pr({j :/3(r/)j 7^ 0} = A) — > 1. 

• Efficient estimation: \fn{J3(i]) A - j3\) — ^ N(0, a 2 C A \). 

Thus i] works as well as the LS-oracle. Fan and Li (2001) showed that the 
SCAD indeed attains the oracle properties. Zou (2006) later demonstrated 
that the adaptive lasso also enjoys the oracle properties. 

1.2. Issues with the oracle. Ideally, if one knows the error distribution, 
then the best oracle is the maximum likelihood estimate knowing the true 
underlying sparse model. However, in the linear regression problems, the 
error distribution (hence the likelihood model) is typically unknown. Hence 
we can only consider a practical oracle procedure. Although Fan and Li 
(2001) treated the (penalized) least squares as a special case in a general 
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(penalized) likelihood based framework when the noise follows a normal 
distribution, we should note that the LS-oracle model selection theory does 
not need the normal error assumption, as long as the error distribution 
has a finite variance. However, the finite variance assumption is crucial for 
the oracle model selection theory based on the least squares. The reason 

is simple. If the error variance is infinite, j3 (oracle) is no longer a root- 
n consistent estimator and can not serve as an ideal method for variable 
selection and coefficient estimation. Consequently, estimators that are shown 
to enjoy the oracle properties, such as the SCAD or the adaptive lasso, 
will also lose their root-n consistency. Such situations occur, for example, 
when the error follows a Cauchy distribution. On the other hand, model 
selection is about discovering the sparse structure in the relation between 
the response and predictors; thus model selection is still a legitimate and 
interesting problem even when the error variance is infinite. See Example 5 
in Section 5. 

The limitation of the LS-oracle motives our work here. We wish to find 
an alternative oracle that can overcome the breakdown issue of the LS ora- 
cle. There are several important considerations when designing a new ora- 
cle estimator. Let (3 (oracle) be a new oracle estimator. Firstly, the new 
oracle estimator should be root-n consistent and enjoy asymptotic normal- 
ity even when the LS-oracle fails to do so. Secondly, we are interested in 
the relative efficiency of the new oracle estimator /3 (oracle) with respect 

to (3 (oracle) when a 2 < oo. Since j3 (oracle) is of full efficiency when 
the error follows a normal distribution, it is impossible to have an ora- 
cle that is universally more efficient than the LS-oracle. However, it would 

--. new 

be very nice to have the relative efficiency of (3 (oracle) with respect to 

(3 (oracle) be bounded from below. This will prevent severe loss of statis- 
tical efficiency even in the worst scenario. Furthermore, we would like to see 

that (3 (oracle) can be significantly more efficient than (3 (oracle) for 
commonly used nonnormal error distributions. Finally, the oracle estimator 
needs to be attainable in the sense that we have an estimating procedure 
that can mimic (3 (oracle), like the SCAD mimics (3 (LS). 

It is not a trivial task to find an oracle estimator that satisfies all the 
above properties. For instance, the least absolute value regression is an ob- 
vious alternative to the least squares. Even for Cauchy-distributed errors, 
the least absolute value regression estimator still enjoys the asymptotic nor- 
mality. The oracle estimator by the least absolute value regression is also 
attainable by the SCAD [see Fan and Li (2001), page 1357]. However, the 
relative efficiency of the least absolute value regression can be arbitrarily 
small when compared with the least squares. Therefore, we do not consider 
it as a safe alternative to the least squares. 
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1.3. Our contributions. In this work we introduce a new regression method 
called composite quantile regression (CQR) that can be used to construct an 
oracle estimator possessing all of the aforementioned properties. We define 
the CQR in Section 2. In Section 3 we study the asymptotic relative efficiency 
of the CQR-oracle with respect to the LS-oracle. A universal lower bound is 
derived, which shows that the relative efficiency is always larger than 70%. 
Moreover, we show by several concrete examples that the CQR-oracle can 
be much more efficient than the LS-oracle. In Section 4 we propose to use 
the adaptive lasso penalty to construct the adaptively penalized CQR esti- 
mator which is shown to achieve the performance of the CQR-oracle if the 
penalization parameter is appropriately chosen. Simulation results are pre- 
sented in Section 5. The technical proofs are presented in Section 6. Section 7 
contains a few concluding remarks. 

2. Composite quantile regression. To motivate CQR, let us briefly re- 
view the quantile regression method [Koenker (2005)]. Note that the condi- 
tional 100r% quantile of y|x is 

v 

^2x i:j P* + b* T 
i=i 

where b* is the 100r% quantile of e. For brevity, we shall assume that the 
density function of e is nonvanishing everywhere. Therefore b* is uniquely 
defined for any < r < 1. Quantile regression estimates (3* by solving 

(2.1) (b T , 3 QRt ) = arg min V p T ( yi - b - Y] Xijfy ) , 

W t=l V j=l J 

where p T (t) = rt + + (1 — r)t_ is the so-called check function where sub- 
scripts + and — stand for the positive and negative parts, respectively. 
Quantile regression has been widely used in various areas such as economics 
[Koenker and Hallock (2001)] and survival analysis [Koenker and Geling (2001)] 
among others. It is well known that under mild regularity conditions [Koenker 
(2005)], 

(2-2) MP QRt ~ P) - d N(0, T^C- 1 ) . 

Quantile regression can be more efficient than the least squares estimator. In 

particular, if e follows a double-exponential distribution, j3 0,5 is the most 

efficient estimator. T does not require a 2 < oo in order to enjoy the root- 
n consistency and asymptotic normality, as opposed to the LS estimator. 
However, the relative efficiency of the quantile regression estimator with 
respect to the LS estimator can be arbitrarily small. 
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To further improve upon the usual quantile regression, we propose to si- 
multaneously consider multiple quantile regression models. Note that the 
regression coefficients are the same across different quantile regression mod- 
els. We shall demonstrate that by combining the strength across multiple 
quantile regression models, we can derive a good estimator that satisfies the 
desired properties discussed in the introduction. 

Denote < T\ < Ti < • ■ • < tk < 1. We consider estimating /3* as follows 

(2.3) (Si, ... , b K ,p CQR ) = argmin ^ p Tk {Vi - b k - xf/3) 

bi,-,b k ,0 k=1 [ i=1 

We call it composite quantile regression, for the objective function in (2.3) is a 
mixture of the objective functions from different quantile regression models. 
Typically, we use the equally spaced qauntiles: T k = -^q-j- for k = 1, 2, . . . , K. 

We now establish the asymptotic normality of (3 . The following two 
regularity conditions are assumed throughout the rest of our discussions: 

(1) There is a p x p positive definite matrix C such that 

lim -X T X = C 

n — *oo 77, 

where C is a p x p positive definite matrix. 

(2) e has cumulative distribution function F{-) and density function /(•). 
For each p- vector u, 

lim - V / ^/n~\F(a + t/y/n) - F(a)] dt 

Conditions (l)-(2) are basically the same conditions for establishing the 
asymptotic normality of a single quantile regression [Koenker (2005)]. Under 
these conditions, we have the following result for the CQR estimate. 



1 
C 



T\T 



(U ,U ) 



Theorem 2.1 (The limiting distribution). Under the regularity condi 

ins 
where 



CQR 

tions (1) and (2), the limiting distribution of \Jn(J3 —(3*) is N(0, Scqr) 



_ r -i E^fc'=imin(r fc ,r fc /)(l - max(r fc , r fc /)) 

JCQR (Ef =1 /(^.)) 2 
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3. Asymptotic relative efficiency. In this section, we investigate the asymp- 
totic relative efficiency (ARE) of the CQR with respect to the least squares. 
The same results can be applied to compute the relative efficiency of the 
CQR-oracle with respect to the LS-oracle. 

Note that when a 2 < oo, the asymptotic variance of the least squares is 
<7 2 C -1 . Therefore, the ARE of the CQR with respect to the least squares is 

A T> T7 1 / r\ a2 (52k=l f(K k )) 2 

(3.1) ARE(n,...,r^,/) - 



J2k,k'=i min(Tfc, r fc /)(l - max(r fc , T k >)) 
We define the CQR-oracle estimator as follows 

(Si, ... , b K , p CQR (oracle) A ) 
(3-2) R 

= argmin Vp Tt [ y { - b k -Vi^j \, 

b!,...,b k ,P k=1 { i=l \ j=1 J) 

and f3 (oracle) A c = 0. By Theorem 2.1 we have 

(3.3) (oracle) A -(3* A )^ d N {0,V 

CQRoracle/ 1 

where 

J2k,k'=i min (n,T k ')(l -max(r fc ,r fc Q) 
CQRoraclc " ^ (Ef =1 /(^ fc )) 2 
For the LS-oracle we have 

(3.4) ^(3 LS (oracle) A - /3* A ) -+ d N(0, <J 2 C AA ). 

Therefore, the asymptotic relative efficiency (ARE) of the CQR-oracle with 
respect to the LS-oracle is also equal to ARE(ti, . . . , tr-,/) given in (3.1). 

Take T k = -jA^ and write AKE(K, f) = ARE(ri, . . . , tk, cr 2 , f). It turns 
out that as K approaches infinity ARE(i"T, /) converges to a limit, denoted 
by 5(f). The next theorem gives us the explicit expression of 5(f) and pro- 
vides a universal lower bound to 5(f). 

Theorem 3.1 The universal lower bound. 

Effc'=imin(Tjfc,T fe /)(l - max(r fc ,r fc /)) 1 



lim 



(Ef=i/(^.)) 2 U{Ee[f(e)]f 



and 



5(f) = lim ARE(KJ) = \2a 2 (E e [f(e)}f 

K— >oo 
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Denote by T the collection of all density functions that satisfy condition (2) 
and have a finite variance. We have 

inf 6(f) > — = 0.7026. 
f&F evr 

Although 6(f) explicitly depends on a 2 , it is actually scale-invariant. We 
should also point out that the lower bound 70.26% given above is conser- 
vative. For commonly used error distributions in practice, 5 is often much 
larger than the lower bound. Having the lower bound is a very useful prop- 
erty. It prevents severe loss in efficiency when using the CQR estimator 
instead of the LS estimator. Even in the worst possible scenario, the poten- 
tial loss in efficiency is less than 30%. Meanwhile, CQR can have big gain 
in efficiency compared to the LS estimator, if the error follows certain types 
of error distributions as shown in the following examples. 

We now calculate 5 for some commonly used distributions. 

Example 1 (Normal distribution). Suppose the error density is /(e) = 
—k=e~ £ I 2 . Then the least squares is the most efficient. We calculate 

Thus S = — = 0.955. In other words, the CQR is almost as efficient as the 
least squares in this case. 

Example 2 (Double exponential distribution). The density function is 
/(e) = |e - l e l. We compute 

/oo 
\e-^de = \. 
-oo 

The error variance is 2. Hence Theorem 3.1 says $ = 2-12-^ = 1.5. 

Example 3 (Logistic distribution). The density function of the logistic 
distribution is /(e) = Tpq^p • We compute 

/oo pie POO a 1 

-oo (1 + e e ) 4 Jo (1 + s) 6 

By Theorem 3.1 we know 

lim S C QR = 3C- 1 . 

A^OO 

On the other hand, the Fisher information of the logistic distribution is |. 
Therefore the CQR can asymptotically achieve the information bound if 
the error follows the logistic distribution. Moreover, the variance of logistic 

2 2 i 2 

distribution is 4p Thus the relative efficiency is ^- • 12 • ^ = ^- = 1.097. 




C: the mixture of double-gammas 




alpha 

Fig. 1. (A): The relative efficiency (5) as a function of the degrees of freedom of the 
T- distribution. The dotted horizontal line indicates 5 = 1, while the solid line indicates 
(5 = 0.955. (Ti): The relative efficiency (5) as a function of r in the mixture of normals 
distribution. The dotted horizontal line indicates 5=1. (C): The relative efficiency (8) as 
a function of a in the mixture of double gamma distribution. 

Example 4 (T-distribution). Let us also consider the T-distribution, 
which is often used to model errors following a heavy-tailed distribution. 

Corollary 3.1. For the T-distribution with degrees of freedom v > 2, 
12 1 / r((z; + l)/2) W l> + l/2) \ 2 

7T v-2\ I>/2) J \ I> + 1) J ' 



For v = 3 5 is 1.9. From panel (A) in Figure 1, we see that the relative 
efficiency is greater than 1 for small degrees of freedoms. It is also interesting 
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to see that for large degrees of freedom the relative efficiency is very close to 
0.955. This is expected because the T-distribution converges to the normal 

as v — > oo. 



Example 5 (A mixture of two normals). We have seen that for the 
normal distributed errors, 5 is 0.955. It turns out that we can let <5 be 
arbitrarily large by slightly perturbing the normal distribution, while keeping 
the error variance bounded. 



Corollary 3.2. Suppose the error follows a distribution of the mixture 
of normals 

e~(l-r)iV(0,l) + r7V(0,r 6 ) 



for < r < 1. Then 



3 / 2 1 2^2r(l 



r 



2 



In Figure 1 panel (B) shows the curve of 5 as a function of r. When r is 
close to 0, the error variance approaches 1, but <5«^(l + i) 2 ^oo. 

Example 6 (A mixture of two double Gamma distributions). We say e 
follows a double Gamma distribution with parameters a if 

/(£) = -w-^ — rk| a er |£| . 

JK J 2T(a + l)' 1 

The double exponential distribution is a special double Gamma distribution 
using a = 0. 

Corollary 3.3. Consider a mixture of double gamma distributions as 
follows: 

£ ^ e "~-e~ |e| + fl - e~ a ) - e a e" |e| 

2 [ ; T(a + l) 

where a > 0. Then 5 is equal to 

12(2e~ a + (1 - 0(a + l)(a + 2)) 

e -2a e -a( 1 _ e -a- ) (\ - e ~ {2a + l)^ 2 

2^+1 + ^r^a + l) 

Using Stirling's formula [Feller (1968)] we have ^ip"^,^ ~ 4 J 7rQ . Thus 

3 can show that for a — > oo, i5~ 
the curve of 5 as a function of a. 



we can show that for a — ► oo, 5 ~ Displayed in panel (C) of Figure 1 is 
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Logistic 





T with 3 degrees ot freedom 




Double Exponential 
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I 

15 



W 

K 

K,f) 



I 

25 



Fig. 2. The approximation accuracy is measured by the ratio 12<T i^ [f\ e )])'i ■ We see that 
the ratio is almost 1 for each K in 9, 10, . . . , 29. The dotted line in each panel indicates 1. 



In the above discussion we have considered the limit of the relative ef- 
ficiency when K — > oo. Empirically, we have found that for a reasonably 
large K, RE(K, f) is already very close to its limit. The ratio ^^(E^fll)])' 1 
measures the approximation accuracy. As can be seen from Figure 2, the 
ratio is very close to 1 for K > 9 in all the four different distributions con- 
sidered there. In practice, it seems that K = 19 is a good choice, which 
amounts to using the 5%, 10%, 15%, . . . , 95% quantiles. 

4. The CQR-oracular estimator. The oracle model selection theory of 
Fan and Li (2001) contains two parts. The first part defines an optimal oracle 
estimator and the second part creates a practical procedure to achieve the 
optimal properties of the oracle. Following Fan and Li (2001), we say an 
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estimation procedure £ is a CQR-oracular estimator, if /3(C) (asymptotically) 
has the following two properties: 

• Consistent selection: : 3(£)j ^ 0} = .4.) -> 1. 

• Efficient estimation: y/n((3(£) A - (3* A ) -> d N(0, ScQRoracic)- 

Adaptive penalization methods have been successfully used to produce LS- 
oracular estimators. Fan and Li (2001) proposed the SCAD-penalized least 
squares and proved its oracle properties. Zou (2006) proposed the adaptive 
lasso and proved its oracle properties. In this section we present a CQR- 
oracular estimator and prove its oracle properties. 

We adopt the adaptive lasso idea from Zou (2006). Suppose we first fit 

CQR 

the CQR estimator using all the predictors. Theorem 2.1 says that (5 is 

CQR, 

root-n consistent. Then we use (3 to construct the adaptively weighted 
lasso penalty and consider the penalized CQR estimator as follows: 

(bi,...,b K ,(3 ) 

(4.1) 

= argmmV<^ $>r fe (^ -b k - xf/3) + ACQR12 ' 

6 i whIh ) j=i\Pj I 

We show that the adaptive lasso penalized CQR estimator (ACQR) enjoys 
the oracle properties of the CQR-oracle. 

Theorem 4.1 (Oracle properties). Assume the two regularity conditions 
in Theorem 2.1. If — ► and A —> oo, then must satisfy 

1. Consistency in selection: Pr({j : ^ C Q R ^ 0} = ->4) — > 1 . 

ACQR 

2. Asymptotic normality: \fn(fi A - (3* A ) -> N(0, ^cQRorake)- 

We have two remarks. 

1. The results in Section 3 can be directly applied to compare the efficiency 
of the ACQR with any LS-oracular estimator. Let rj be a LS-oracular 
estimator. Then y/n@ A - (3* A ) -> N(0, C^<r 2 ), if a 2 < oo. The relative 
efficiency of the ACQR with respect to rj is ARE(ti,T2, • • • , tk, f) in (3.1). 
Therefore, the relative efficiency of the ACQR compared to 77 is always 
larger than 0.70 and can greatly exceed 1 for some error distributions. 

2. We can also use the SCAD penalty in the (4.1) and the resulting esti- 
mator should also possess the oracle properties of the CQR-oracle. We 
choose the adaptive lasso penalty only for the computational considera- 

■*■*• ACQR 

tion. Note that similar to ordinary quantile regression, computing (3 

is equivalent to a linear programming problem. Thus we can efficiently 

compute the ACQR estimator using the standard linear program solver. 
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5. A simulation study. In this section we use simulation to compare the 
LS-oracle and the CQR-oracle and examine the performance of the ACQR 
with finite samples. Our simulated data consist of a training set and an 
independent validation set. Models were fitted on training data only, and the 
validation data were used to select the tuning parameters. We simulated 100 
data consisting of 100 training observations and 100 validation observations 
from the model 

y = x T (3 + e, 

where (3 = (3, 1.5, 0, 0, 2, 0, 0, 0) and the predictors (x±, X2, x%, x±, x$,xq, xj, x&) 
follow a multivariate nor mal distribution N{0, S x ) with (£ x )ij = 0.5^1 for 
1 < 2, j < 8. This regression model was considered in Tibshirani (1996) and 
Fan and Li (2001). Here we considered five different error distributions. 

Example 1. e ~ N(0,3). 

Example 2 . e = ae* where e* follows the mixture of normal distribution 
as m Corollary 3.2 with r = 0.5. We let a = \/6. 

Example 3. e = as* where e* follows the mixture of normal double 
gamma as in Corollary 3.3 with a = 14. We let a = g. 

Example 4. The error distribution is T-distribution with 3 degrees of 
freedom. 

Example 5. The error distribution is Cauchy. 

We used the quantiles Tk = t|j for k = 1, 2, . . . , 19 in the CQR-oracle and 
the ACQR. The model error is computed by 

ME = £[(3-/3) T £ X (3-/3 T )]. 

We use the notation (NC,NIC) to denote the variable selection result, 
where NC denotes the number of predictors in {xi,X2, x§} that have nonzero 
coefficient vectors, and NIC denotes the number of predictors in {x3,X4,X6, 
xi,x%\ that have nonzero coefficient vectors. Table 1 shows the average 
model errors and variable selection results over 100 replications. In the 
asymptotic sense, the LS-oracle is the best in Example 1, while the CQR- 
oracle works better in Examples 2-4. The numerical experiments agree with 
the theory. We also see that the model error of the ACQR is close to that 
of the CQR-oracle. Table 1 shows that the ACQR does an excellent job in 
variable selection. 

Example 5 is different from Examples 1-4, because the error distribution 
has infinite variance in Example 5. The LS-oracle is not the optimal estima- 
tor in this case. The simulation confirmed the theory. The model error of 
the LS-oracle is more than 2500. The CQR-oracle and the ACQR still work 
very well in Example 5. 
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Table 1 
Simulation results 





Example 1 


Example 2 


Example 3 


Example 4 


Example 5 


LS-oracle 


0.079 


0.104 


0.091 


0.082 


2788 


Model error CQR-oracle 


0.085 


0.033 


0.043 


0.060 


0.134 


ACQR 


0.112 


0.046 


0.048 


0.077 


0.174 


Variable selection ACQR 


(3, 0.53) 


(3, 0.21) 


(3, 0.23) 


(3, 0.39) 


(3, 0.53) 



6. Proofs. 



Proof of Theorem 2.1. 

Let y/n(/3 -(3*) = u„ and yjn(b k -b* k ) =u n>k - Then (u n ,i, . . . ,u n , K ,Un) 
is the minimizer of the following criterion: 



K n 
k=li=l 



U k + xf U . 

£i ~ K, ) - p Tk {£i ~ K k ) 



n 



By the identity [Knight (1998)] 

-s(I(r > 0) - I(r < 0)) + 2 [\l(r < t) - I(r < 0)] dt, 

Jo 



\r — s\ — \r 
we have 



p T (r - s) - p T (r) = s{I(r < 0) - r) + f[I(r < *) - < 0)] dt. 

Jo 



Thus we write L n as follows: 

K n . x 



fc=lt=l 



71 



+ E E / [I(ei<b* k +t)- I( £i <b* Tk )] dt 



fe=li=l 



K 

E 

k=l 



K 

fc=l 



where 



\/7t . , 



i=l 



T 



E- 

i=l Lfc=l 



K 
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4 fc) = E 



i=l 



(u fe +xf u)/Vn 



[I{Ei<b*+t) - I{Si<b*)]dt. 



By the Cramer- Wald device and CLT, we know 

(z n , i i,...,z rife ,z^) T ^ d (zi,...,Zfc,z T )' r ~iV(0,5]) for some £ 

and 

K K 

E ^n.fe^ifc + z n u -><i X! Zfc, " fe + zTu - 
fc=l fc=l 



Moreover, we have 



E[B { n k) ] = E 



n r (« fc +xfu)/VH 



i=l 
1 n 

-E 



«fc+x/ u 



■/?. 



r// 



-/(6;jK,u T ) 



1 

c 



(u k ,u T ) T . 



Var[i#>] 



E^ 



<E^ 

j=i 



x 2 



(u k +xfu)/y/n. 



(I(ei<b*+t)-I{ei<b. 



Tk' 



[F(b*+t)-F(b*)])dt 



(I( £l <K k +t)-I(e t <b* Tk ) 
-[F(b* Tk +t)-F(b* Tk )])dt 



u k + xf u 



n 



maxi<j< n \u k + xfu| 



Hence 



<4 £[fl «] ^ 
-►0. 

^ - p ^/(6;jk,u t ) 



1 

c 



K,u T ) T 
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Thus it follows that 

L n V(ux, . . .,u k ,u) 

= £ Znj.Uk + z T u + J2 \f{K k )u\ + \ (j2 f(b* k )) u T Cu. 

k=l k=l \k=l I 

Since L n is a convex function, then following Knight (1998) and Koenker 
(2005) we have 



However, 



K 



J2m k ))c 



K 



Z ~ 



A' 



S z = C Var (J>( £ < K k ) ~ r k ]j = C 



A 



£ min(r fe ,r fe /)(l - max(r fc ,T fc /)) 
.k,k'=l 



Therefore, 



;CQR 



MP ~(3*)^dN 0,C 



lE^fc'=i r nin(rjfc,r fe /)(l - max(r fc , T k >)) 



□ 



Proof of Theorem 3.1. First, it is easy to check that if r k 



then 



A+i - 



A 



mm(T k ,T k ')(l - max(r k ,r k >)) -> — . 



fc,fc'=i 



On the other hand, 



1 



A 



K 



k=i 



K 



E/ [F- 1 



k=l 



k 



K + l 



f{F-\s))ds 
= Eu[f(F~\U))], 

where U ~ Unif(0, 1). Note that F~ l {U) follows the distribution of e, thus 

Eu[f(F-\U))]=E E [f(e)}. 
To prove the lower bound, first we use Jensen's inequality 
£ £ [/( £ )]>exp(£ e [log(/( £ ))]. 
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Let g(e) = -j= 2 e ( e ^) 2 /( 2fj2 ), where [i is the mean of the error distribution. 
Then by the entropy inequality we have 

7(e) 



log 



9(e) 



> 0. 



On the other hand, we compute 
E e [\og(g(e))]=E e 

= log 

Thus we have 



log 



1 



V27TC7 2 



2<r 2 



/ 1 



£ E [/(e)] >exp(log 
S > 12a 2 



VV27TO- 2 
1 



V27TC7 2 



1 



1 

" 2' 

1 
2 



v 7 ^ 



V2 



— = 0.7026. 

7re 



□ 



Proofs of Corollaries 3.1-3.3. The proofs are direct applications 
of Theorem 3.1. The details are given in a technical report of the paper 
[Zou and Yuan (2007)], thus omitted for the sake of space. □ 

ACQ Ft 

Proof of Theorem 4.1. We write A = A n . Let y/n((3 -f3*) = u n 
and y / n(6fc — b*) = u n,k- Then (it n; i, . . . , n n ,Xi u n.) is the minimizer of the 
following criterion: 



K n 
fc=li=l 



u k + x| u 



E 



A, ( 



i v^l/f QR | 2 



/?* + 



7? 



U I 



Pr k ( £ i-K k ) 



Following the arguments in the proof of Theorem 2.1, we write L n as follows: 



A, 



CQR, 2 



If /?* / 0, then |/f QR | 2 - p |/3*| 2 , and v^(|/?* + ^1 - 1/3*1) - %sgn(/3*). 
By Slutsky's theorem, —*^ y fc(\ff> + _ |^*|) _ p . If /?* = 0, then 
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Vn(\(3j + - = \uj\, and ^^qr |2 °°. Therefore, we have 




if P* 7^0, 
if /3* = and i 
if /3* = and u s / 0. 



0, 



Thus it follows that 



K 



A 



L n ~>d E Z k U k + + E \f(K k ) u l 



k=l 



k=l 



A 



+ M E ) J uT ° u + E . 



^=1 



Let us write u = (irf, u^) 7 " where ui contains the first q elements of u. Using 
the same arguments in Knight (1998) and Koenker (2005), we have 



u 2 , 







and 



A' 



J2fK) c 



\k=l 



^AA^i^AA. ' 



^zi — C^_4 



A" 



Y min(r fe ,r fe /)(l - max(r fc ,r fe /)) 
.fc,fc'=i 



-ACQR 



Therefore, the asymptotic normality is proven. 

We now prove the consistent selection result. Let A n = {j :(3j "° ^ 0}. 
Vj G ^4, the asymptotic normality indicates Pr(j G A n ) — > 1. Then it suffices 
to show that Vj ^ .4, Pr(j G A n ) 0. We know | pr(r ^I^ (r2) | < max(r, 1 - 
r) < 1. If j G «4 n > then we must have j^rirffrji < Z)?=i l^ijl- Thus we have 



Pr(j G A) < Pr(^ 



'1/3 



< E?=i |xy|). But iE?=i Nl < ^ E?=l k 



oo, thus P(j G A) 0. □ 



7. Concluding remarks. Fan and Li (2001) introduced the concept of or- 
acle model estimator and proposed the SCAD method to achieve the oracle 
properties. Fan and Li (2001) showed that ideally if one knows the like- 
lihood model, then the oracle is the maximum likelihood estimate know- 
ing the true underlying sparse model, and the SCAD estimator is obtained 
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by the penalized likelihood model using the SCAD penalty. However, in 
the linear regression problems, the error distribution (hence the likelihood 
model) is typically unknown. Hence we can only consider a practical ora- 
cle procedure. Fan and Li (2001) showed that penalized least squares us- 
ing the SCAD penalty is a practical oracular estimator. Unfortunately, 
the LS-oracle and the SCAD procedure break down when the error dis- 
tribution has an infinite variance. The other oracular method, the adaptive 
lasso [Zou (2006)], has the same trouble, because it also mimics the LS- 
oracle. 

In this work we have proposed the composite quantile regression and 
proven its nice theoretical properties. We have shown that the oracle model 
selection theory of Fan and Li (2001) still works beautifully even for the 
cases where the error variance is infinite, as long as we replace the LS-oracle 
with the CQR-oracle. Compared with the LS-oracle, the CQR-oracle has 
two remarkable nice properties: 

(1) Its relative efficiency is always larger than 70%. 

(2) In the Gaussian model the relative efficiency of the ACQR is 95.5%. 
With nonnormal errors, its relative efficiency could be arbitrarily large. 

Following the lines of Fan and Li (2001) and Zou (2006), we have devel- 
oped the adaptively penalized CQR method (ACQR) and proven the oracle 
properties of the ACQR. There are a family of penalty functions, including 
the SCAD penalty, that can be used to create CQR-oracular estimators. We 
have used the adaptive lasso penalty only for its computational convenience. 

As pointed out by the associate editor, we might also consider a general 
composite quantile regression problem by minimizing 



where the weight function w(t) is a density function over (0, 1). The proposed 
CQR criterion uses a discrete uniform distribution on {j^+j, ■ ■ ■ , j^fi }• Al- 
though the weight function w(t) could be a continuous density function in 
(7.1), it seems that we need to discretize the integral in order to numerically 
compute the estimator. Hence, technically speaking, a discrete distribution 
density is used to construct the weights. In this paper we have shown that 
the discrete uniform distribution leads to an interesting estimator which en- 
joys various nice properties. It would be interesting to see whether other 
distributions could result in similar estimators. This is an open problem for 
future research. 



(7.1) 
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